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Abstract 

In this review, we discuss the latest targeted enrichment methods and aspects of their utilization along with 
second-generation sequencing for complex genome analysis. In doing so, we provide an overview of issues involved 
in detecting genetic variation, for which targeted enrichment has become a powerful tool. We explain how targeted 
enrichment for next-generation sequencing has made great progress in terms of methodology, ease of use and ap- 
plicability, but emphasize the remaining challenges such as the lack of even coverage across targeted regions. Costs 
are also considered versus the alternative of whole-genome sequencing which is becoming ever more affordable. 
We conclude that targeted enrichment is likely to be the most economical option for many years to come in a 
range of settings. 
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INTRODUCTION 

Next-generation sequencing (NGS) [1, 2] is now a 
major driver in genetics research, providing a power- 
ful way to study DNA or RNA samples. New and 
improved methods and protocols have been de- 
veloped to support a diverse range of applications, 
including the analysis of genetic variation. As part of 
this, methods have been developed that aim to 
achieve 'targeted enrichment' of genome subregions 



[3, 4], also sometimes referred to as 'genome parti- 
tioning'. Strategies for direct selection of genomic 
regions were already developed in anticipation of 
the introduction of NGS [5, 6]. By selective 
recover and subsequent sequencing of genomic loci 
of interest, costs and efforts can be reduced signifi- 
cantly compared with whole-genome sequencing. 

Targeted enrichment can be useful in a number 
of situations where particular portions of a 
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whole genome need to be analyzed [7]. Efficient 
sequencing of the complete 'exome' (all transcribed 
sequences) represents a major current application, 
but researchers are also focusing their experiments 
on far smaller sets of genes or genomic regions po- 
tentially being implicated in complex diseases [e.g. 
derived from genome-wide association studies 
(GWAS)], pharmacogenetics, pathway analysis and 
so on [1, 8, 9]. For identifying monogenetic diseases, 
exome sequencing can be a powerful tool [10]. 
Across all these areas of study, a typical objective is 
the analysis of genetic variation within defined 
cohorts and populations. 

Targeted enrichment techniques can be charac- 
terized via a range of technical considerations related 
to their performance and ease of use, but the prac- 
tical importance of any one parameter may vary de- 
pending on the methodological approach applied 
and the scientific question being asked. Arguably, 
the most important features of a method, which in 
turn reflect the biggest challenges in targeted enrich- 
ment, include: enrichment factor, ratio of sequence 
reads on/off target region (specificity), coverage (read 
depth), evenness of coverage across the target region, 
method reproducibility, required amount of input 
DNA and overall cost per target base of useful 
sequence data. 

Within this review, we compare and contrast 
the most commonly used techniques for targeted 
enrichment of nucleic acids for NGS analysis. 
Additionally, we consider issues around the use of 
such methods for the detection of genetic variation, 
and some general points regarding the design of the 
target region, input DNA sample preparation and the 
output analysis. 



ENRICHMENT TECHNIQUES 

Current techniques for targeted enrichment can be 
categorized according to the nature of their core re- 
action principle (Figure 1): 

(i) 'Hybrid capture': wherein nucleic acid strands 
derived from the input sample are hybridized 
specifically to preprepared DNA fragments com- 
plementary to the targeted regions of interest, 
either in solution or on a solid support, so that 
one can physically capture and isolate the 
sequences of interest; 

(ii) 'Selective circularization': also called molecular 
inversion probes (MIPs), gap-fill padlock probes 



and selector probes, wherein single-stranded 
DNA circles that include target region se- 
quences are formed (by gap-filling and ligation 
chemistries) in a highly specific manner, creating 
structures with common DNA elements that are 
then used for selective amplification of the tar- 
geted regions of interest; 
(iii) PCR amplification: wherein polymerase chain 
reaction (PCR) is directed toward the targeted 
regions of interest by conducting multiple 
long-range PCRs in parallel, a limited number 
of standard multiplex PCRs or highly multi- 
plexed PCR methods that amplify very large 
numbers of short fragments. 

Given the operational characteristics of these dif- 
ferent targeted enrichment methods, they naturally 
vary in their suitability for different fields of applica- 
tion. For example, where many megabases needs 
to be analyzed (e.g. the exome), hybrid capture 
approaches are attractive as they can handle large 
target regions, even though they achieve suboptimal 
enrichment over the complete region of interest. 
In contrast, when small target regions need to be 
examined, especially in many samples, PCR-based 
approaches may be preferred as they enable a deep 
and even coverage over the region of interest, suit- 
able for genetic variance analysis. 

An overview of these different approaches is pre- 
sented in Figure 1, and Table 1 lists the most 
common methods along with additional 
information. 

Basic considerations for targeted 
enrichment experiments 

The design of a targeted enrichment experiment 
begins with a general consideration of the target 
region of interest. In particular, a major obstacle for 
targeted enrichment is posed by repeating elements, 
including interspersed and tandem repeats as well as 
elements such as pseudogenes located within and 
outside the region of interest. Exclusion of repeat 
masked elements [11] from the targeted region is a 
straightforward and efficient way to reduce the re- 
covery of undesirable products due to repeats. 
Furthermore, at extreme values (<25% or >65%), 
the guanine-cytosine (GC) content of the target 
region has a considerable impact on the evenness 
and efficiency of the enrichment [12]. This can 
adversely affect the enrichment of the 5 7 -UTR/ 
promoter region and the first exon of genes, which 
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Figure I: Commonly used targeted enrichment techniques. (I) Hybrid capture targeted enrichment either on solid 
support-like microarrays (a) or in solution (b). A shot-gun fragment library is prepared and hybridized against a li- 
brary containing the target sequence. After hybridization (and bead coupling) nontarget sequences are washed 
away, the enriched sample can be eluted and further processed for sequencing. (2) Enrichment by MIPs which are 
composed of a universal sequence (blue) flanked by target-specific sequences. MIPs are hybridized to the region of 
interest, followed by a gap filling reaction and ligation to produce closed circles. The classical MIPs are hybridized 
to mechanically sheared DNA (a), the Selector Probe technique uses a restriction enzyme cocktail to fragment 
the DNA and the probes are adapted to the restriction pattern (b). (3) Targeted enrichment by differing PCR 
approaches. Typical PCR with single-tube per fragment assay (a), multiplex PCR assay with up to 50 fragments (b) 
and RainDance micro droplet PCR with up to 20 000 unique primer pairs (c) utilized for targeted enrichment. 



are often GC rich [13]. Therefore, expectations 
regarding the outcome of the experiment require 
careful evaluation in terms of the precise target 
region in conjunction with the appropriate enrich- 
ment method. 

The performance of a targeted enrichment ex- 
periment will also depend upon the mode and qual- 
ity of processing of the input DNA sample. Having 
sufficient high-quality DNA is key for any further 
downstream handling. When limited genomic DNA 
is available, whole-genome amplification (WGA) is 
usually applied. Since WGA produces only a repre- 
sentation and not a replica of the genome, a bias is 
assumed to be introduced though the impact of this 
on the final results can be compensated for, to a degree 
by identically manipulating control samples [14]. 



All three major targeted enrichment techniques 
(hybrid capture, circularization and PCR) differ in 
terms of sample library preparation workflow enabl- 
ing sequencing on any of the current NGS instru- 
ments (e.g. Illumina, Roche 454 and SOLiD). 
Enrichment by hybrid selection relies on short frag- 
ment library preparations (typically range from 100 
to 250 bp) which are generated before hybridization 
to the synthetic library comprising the target region. 
In contrast, enrichment by PCR is performed dir- 
ectly on genomic DNA and thereafter are the library 
primers for sequencing added. Enrichment by circu- 
larization offers the easiest library preparation for 
NGS because the sequencing primers can be added 
to the circularization probe, thus eliminating the 
need for any further library preparation steps. 
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Sequencing can be performed either as single read or 
paired-end reads of the fragment library. In general, 
mate— pair libraries are not used for hybridization- 
based targeted enrichments due to the extra compli- 
cations this implies in terms of target region design. 

In general, a single NGS run produces enough 
reads to sequence several samples enriched by one 
of the mentioned methods. Therefore, pooling stra- 
tegies and indexing approaches are a practical way 
to reduce the per sample cost. Depending on the 
method used for targeted enrichment, different 
multiplexing strategies can be envisaged that enable 
multiplexing in different stages of the enrichment 
process: before, during and after the enrichment. 
For targeted enrichment by hybrid capture, indexing 
of the sample is usually performed after the enrich- 
ment but to reduce the number of enrichment reac- 
tions, the sample libraries can alternatively be 
indexed during the library preparations and then 
pooled for enrichment [15]. Enrichment by PCR 
and circularization offers indexing during the enrich- 
ment by using bar-coded primers in the product 
amplification steps [16]. Furthermore, two multi- 
plexing strategies can be combined in a single ex- 
periment. First, multiple samples can be enriched as a 
pool, with each harboring a unique pre-added 
bar-code. Then second, another bar-coding proced- 
ure can be applied postenrichment, to each of these 
pools, giving rise to a highly multiplexed final pool. 
If such extensive multiplexing is used, great care 
must be taken to normalize the amount of each 
sample within the pool to achieve sufficiently even 
representation over all samples in the final set of se- 
quence reads. In addition, highly complex pooling 
strategies also imply far greater challenges when it 
comes to deconvoluting the final sequence data 
back into the original samples. 

The task of designing the target region is relatively 
straightforward, and this can be managed with web- 
based tools offered by UCSC, Ensembl/BioMart, 
etc. and spreadsheet calculations (e.g. Excel) on a 
personal computer. Web-based tools like MOPeD 
offer a more user-friendly approach for oligoncleo- 
tide probe design [17]. Far more difficult, however, is 
the final sequence output analysis, which needs dedi- 
cated computer hardware and software. Fortunately, 
great progress has recently been made in read map- 
ping and parameter selection for this process, leading 
to more consistent and higher quality final results 
[18]. Reads generated by hybrid selection will always 
tend to extend into sequences beyond the 



target region and the longer the fragment library is, 
the more of these 'near target' sequences will be re- 
covered. Therefore, read mapping must start with a 
basic decision regarding the precise definition of the 
on/ off target boundaries, as this parameter is used for 
counting on/ off target reads and so influences the 
number of sequence reads considered as on target. 
This problem is not so critical for enrichments based 
on PCR and circularization as these methods do not 
suffer from near target' products. Another major 
consideration in data analysis is the coverage needed 
to reliably identify sequence variants, e.g. single nu- 
cleotide polymorphisms (SNP). This depends on 
multiple factors such as the nature of the region of 
interest in question, the method used for targeted 
enrichment. In different reports, it has ranged from 
8x coverage [19], which was the minimum coverage 
for reliable SNP calling and up to 200x coverage 
[20], in this case the total average coverage for the 
targeted region. 

Enrichment by hybrid capture 

Enrichment by hybrid capture (Figure 1.1a and b) 
builds on know-how developed over the decade or 
more of microarray research that preceded the NGS 
age [21, 22]. The hybrid capture principle is based 
upon the hybridization of a selection 'library' of very 
many fragments of DNA or RNA representing the 
target region against a shotgun library of DNA frag- 
ments from the genome sample to be enriched. Two 
alternative strategies are used to perform the hybrid 
capture: (i) reactions in solution [4] and (ii) reactions 
on a solid support [3]. Each of these two approaches 
brings different advantages, as listed in Table 1. 

Selection libraries for hybrid capture are typically 
produced by oligonucleotide synthesis upon micro- 
arrays, with lengths ranging from ^60 to ^180 bases. 
These microarrays can be used directly to perform 
the hybrid capture reaction (i.e. surface phase meth- 
ods), or the oligonucleotide pool can be harvested 
from the array and used for an in-solution targeted 
enrichment (i.e. solution phase methods). The de- 
tached oligonucleotide pool enables versatile down- 
stream processing: if universal 5 r - and 3 r -end 
sequences are included in the design of the oligo- 
nucleotides, the pool can be reamplified by PCR and 
used to process many genomic samples. Furthermore, 
it is possible to introduce T7/SP6 transcription start 
sites via these PCRs [23], so that the pool can be 
transcribed into RNA before being used in an en- 
richment experiment. 
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Recently, an increasing number of protocols and 
vendors have begun offering out of the box solutions 
for hybrid capture, meaning, the researcher need 
not do development work but merely choose be- 
tween a preset targeted enrichment regions (e.g. 
whole exome) or specify their own custom enrich- 
ment region. Example vendors include: Agilent 
(SureSelect product), NimbleGen (SeqCap EZ prod- 
uct), Flexgen and MYcro Array. Alternatively, a 
more cost efficient option compared with buying a 
complete kit involves ordering a synthetic bait 
library, reamplifying this by PCR [24], optionally 
transcribing this into RNA and undertaking a 
do-it-yourself enrichment experiment based upon 
published protocols. 

Enrichment by circularization 

Enrichment by DNA fragment circularization is 
based upon the principle of selector probes [6, 25] 
and gap-fill padlock or MIPs [26]. This approach 
differs significantly compared with the aforemen- 
tioned hybrid capture method. Most notably, it is 
greatly superior in terms of specificity, but far less 
amenable to multiple sample co-processing in a 
single reaction. Each probe used for enrichment by 
circularization comprises a single-stranded DNA 
oligonucleotide that at its ends contains two se- 
quences that are complementary to noncontiguous 
stretches of a target genomic fragment, but in re- 
versed linear order. Specific hybridization between 
such probes and their cognate target genomic frag- 
ments generates bipartite circular DNA structures. 
These are then converted to closed single-stranded cir- 
cles by gap filling and ligation reactions (Figure 1.2). A 
rolling circle amplification step or a PCR directed 
toward sequences present in the common region of 
all the circles is then finally applied to amplify the 
target regions (circularized sequences) to generate an 
NGS library. 

Variations on this basic method concept exist, in 
particular with regard to the differences in sample 
material preparation and downstream processing for 
NGS library preparation. In the gap-fill padlock or 
MIPs implementation (Figure 1.2a), the sample 
DNA is fragmented by shearing and used in the bi- 
partite circular structure to provide a template for the 
probe DNA to be extended by gap filling and con- 
verted to a closed circle. In this incarnation, the de- 
sign of the MIPs merely has to consider the 
uniqueness of each target region fragment and the 
most suitable hybridization conditions. In contrast, a 



more elaborated design is offered by the 'Selector 
Probe' technique [6, 27]. Here the genomic DNA 
is fragmented in a controlled manner by means of a 
cocktail of restriction enzymes, and the selector 
probes are designed to accommodate the restriction 
pattern of the target region. The ends of each gen- 
omic DNA thus become adjacently positioned in the 
bipartite circles, enabling them to be gap filled and 
ligated into closed single-stranded circles (Figure 
1.2b). 

A particularly appealing feature of enrichment by 
circularization with MIPs and selectors is their 'li- 
brary free' nature [28]. Since MIPs and selectors 
comprise a target- specific 5 f - and 3 r -end with a 
common central linker, the sequencing primer infor- 
mation for NGS applications can be directly included 
into this common linker. Burdensome NGS library 
preparations are therefore not required, reducing 
processing time markedly. 

Enrichment by PCR 

Enrichment by PCR (Figures 1.3a— c) is in terms of 
methodology, a more straightforward method com- 
pared with the other genome pardoning techniques. 
It takes advantage of the great power of PCR to 
enrich genome regions from small amounts of 
target material. Just as for circularization methods, 
if the PCR product sizes fall within the sequencing 
length of the applied NGS platform (maximum read 
length for SOLiD: 110 bp, Alumina: 240 bp and 454: 
1000 bp) PCR-based enrichment can allow one to 
bypass the need for shot-gun library preparation by 
using suitably 5 r -tailed primers in the final amplifica- 
tion steps. 

The main downside of the method is that it does 
not scale easily, in any format, to enable the targeting 
of very large genome sub regions or many DNA sam- 
ples. To use this method effectively, any significant 
extent of parallelized singleplex or multiplex PCR 
would need to be supported by the use of automated 
robotics, individual PCR amplicons (or multiplex 
products) need to be carefully normalized to equiva- 
lent molarities when pooling in advance of NGS (so 
that the final coverage of the total region of interest is 
as even as possible), and the amount of DNA mater- 
ial a study requires can be substantial as this require- 
ment grows linearly with the number of utilized 
PCR reactions. But if the target region is small, 
PCR can be the method of choice. For example, a 
target region of 50-100 kb or so, could be spanned 
by a handful of long-range PCRs each of 5— lOkb 
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[29], or by tiling a few hundred shorter PCRs and 
using microtiter plates and robotics, or by one or 
other approaches toward PCR multiplexing [30, 31]. 

Long-range PCR is the most commonly applied 
approach and it is reasonably straightforward to ac- 
complish. Many vendors now offer specially formu- 
lated kits (e.g. Invitrogen SequalPrep, Qiagen 
SeqTarget) that can amplify fragments of up to 
20 kb in length. And obviously, this approach is 
fully compatible with automation. Long-range 
PCR products do, however, have to be cleaned, 
pooled and processed for shot-gun library prepar- 
ation so that they are ready for analysis by NGS. 

To increase the throughput of PCR by keeping 
the number of PCR reactions as low as possible, there 
is the alternative of multiplex PCR (Figure 1.3b). 
Given careful primer design and reaction optimiza- 
tion, several dozen primer pairs can be used together 
effectively in a multiplexing reaction [32]. Indeed, 
software specifically created to help with multiplex 
PCR assay design is available [33] . Then, by running 
many such reactions in parallel, many hundred dif- 
ferent DNA fragments can be amplified. An alterna- 
tive method that is commercially available from 
Fluidigm (Table 1), uses a micro fluidics PCR chip 
to conduct several thousand singleplex PCRs in 
parallel. 

Yet, another strikingly elegant method is the 
micro-droplet PCR technology developed by 
Raindance [34, 35]. Here, two libraries of lipid en- 
capsulated water droplets are prepared — one in 
which each droplet contains a small amount of the 
test sample DNA and the other comprising droplets 
that harbor distinct pairs of primers. These two 
libraries are then merged (respective droplet pairs 
are fused together) to generate a highly multiplexed 
total emulsion PCR wherein each reaction is actually 
isolated from all others in its own fused droplet 
(Figure 1.3c). Using this technology, up to 20 000 
primer pairs can be used effectively in parallel in a 
single tube. 

Overall, one can draw the following conclusions 
from a comparison of the currently used enrichment 
techniques shown in Table 1: (i) that hybrid capture 
has its main advantages for medium to large target 
regions (10— 50 Mb) in contrast to the other two 
approaches which typically only target small regions 
within the kilo base pairs and low mega base pairs 
range. The ability to enrich for mega base pair-sized 
targets is particularly advantageous in research studies 
where typically whole exomes or many genes are 



involved. Especially for clinical applications, this 
may be relevant for oncological applications where 
one would expect to sequence 100— 1000's of genes, 
(ii) The advantage of PCR and circularization-based 
methods is that they achieve very high enrichment 
factors and few off-target reads, but only for small 
target regions. This is more suited to clinical genetics 
where typically only a few critical loci need to be 
assessed. 

Descriptive metrics for targeted DNA 
enrichment experiments 

To allow meaningful comparison of enrichment 
methods and experiments that employ them, and 
to rationally decide which technologies are most 
suitable when designing a research project, it is im- 
portant that an objective set of descriptive metrics are 
defined and then widely used when reporting en- 
richment datasets. A series of metrics need to be 
considered, and the importance of each can be 
weighted according to specific needs and objectives 
of any experiment. A proposal for such a set of met- 
rics is soon to be published, and it contains the fol- 
lowing (Nilsson etal, manuscript in preparation): 

(i) Region of interest (size): ROI; 

(ii) Average read depth (in ROI): D; 

(iii) Fraction of ROI sufficiently covered (at a spe- 
cified D): F; 

(iv) Specificity (fraction of reads in ROI): S; 

(v) Enrichment Factor (D for ROI versus D for rest 
of genome): EF; 

(vi) Evenness (lack of bias): E and 

(vii) Weight (input DNA requirement): W. 

A theoretical examination of how a method's 
innate enrichment capability and the size of the tar- 
geted region work together to determine other par- 
ameters (such as specificity and read depth) can be 
very instructive when choosing an enrichment 
method for a particular application. This is illustrated 
in Figure 2. For example, given a method's specific 
enrichment factor and knowledge about the size of 
the region of interest, the corresponding sequencing 
effort can be estimated for a given desired specificity 
(percent of sequences on target). Similarly, for a 
given region of interest and a minimum desired spe- 
cificity, the necessary enrichment factor capabilities 
can be calculated. 

Finally, the specific per sample costs for a targeted 
enrichment is useful to consider. To make costs 
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Sequence Depth vs Enrichment Factor; 
assumes 1 Gbp sequencing run (solid line) 
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Enrichment Factor (dashed line) 
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Figure 2: Comparison of enrichment factor calculations on sequencing depth and percent on target sequences for 
different target region sizes employed for targeted enrichment. Calculations were performed as follows: 
percent on target sequences = ' 00 * E F*R0/-R0/+genome t 52 ] sequencing depth = ^*^ro f run EF, enrichment factor; 
ROI, region of interest in kb; genome, genome size in kb; pot, percent on target sequences; seq per run, assumed 
sequences per run in kb. 



comparable, either for different target region sizes or 
across different methods, the costs can be normalized 
as costs per base pair. Costs also change with time 
and as technologies improve, and so at some stage 
the overall price of any particular experiment (i.e. 
targeted enrichment plus sequencing costs) will not 
be cheaper than the alternative of whole-genome 
sequencing combined with in 5i7ico-based isolation 
of the region of interest. 

DISCOVERY OF GENETIC 
VARIATION 

To investigate genetic variation by NGS, many 
DNA samples need to be tested. To reduce the 
cost of such studies, researchers typically focus their 
attention on genome subregions of particular inter- 
est, and this implies a major role for targeted enrich- 
ment in such undertakings. A set of concerns then 
arises regarding the accuracy of variation discovery 



within NGS data obtained from DNA that has been 
subjected to one or other enrichment methods. 
Other questions, such as whether the input genomic 
DNA was also preamplified by WGA, whether sam- 
ple pooling or multiplexing was applied and whether 
proper experimental controls were employed, also 
come into play. Currently, however, the field is lack- 
ing a complete understanding of all the issues and 
influences relevant to these important questions. 
For these reasons, it is critical that thorough down- 
stream validation experiments are performed, using 
independent experimental approaches. 

Another dimension to the problem of reliably dis- 
covering sequence variation, and one where there is 
perhaps a little more clarity, is the impact of different 
software and algorithm choices used for primary se- 
quence data analysis (e.g. the choice of suitable gen- 
ome alignment tool, filter parameters for the analysis, 
coverage thresholds at intended bases). It has been 
shown that the detection of variants depends strongly 
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on the particular software tools employed [36]. 
Indeed, because current alignment and analytical tools 
perform so heterogeneously, the 1000 Genomes 
Project Consortium [37] decided to avoid calling 
novel SNPs unless they were discovered by at least 
two independent analytical pipelines. In general, uni- 
fied analysis workflows can and must be developed 
[38] to enable the combination and processing of 
data produced from different machines/approaches, 
to at least minimize instrument-specific biases and 
errors that otherwise detract from making high- 
confidence variant base calls. 

Whatever mapping and analysis approach is 
applied, sufficient coverage on a single base reso- 
lution ranging from 20 to 5 Ox is usually deemed 
necessary for reliable detection of sequence variation 
[39-42]. In one simulation study, the SNP discovery 
performances of two NGS platforms in a specific 
disease gene were shown to fall rapidly when the 
coverage depth was below 40x [43]. In addition, 
all called variants should ideally be supported by 
data from both read orientations (forward and re- 
verse). Some researchers further insist on obtaining 
at least three reads from both the forward and the 
reverse DNA strands (double-stranded coverage) for 
any nonreference base before it is called [20]. Such 
stringent quality control practices are surely needed 
to minimize error rates and the impact of random 
sampling variance, so that true variations and sequen- 
cing artifacts can be resolved and homozygous and 
heterozygous genotypes at sites of variation reliably 
scored. 

Deep coverage alone seems not, by itself, to 
always be sufficient for accurate variation discovery. 
For example, a naive Bayesian model for SNP call- 
ing — even with deep coverage — can lead to consid- 
erable false positive rates [38]. Thus, other stringent 
filtering parameters should also be applied, such as 
filtering out SNP calls that occur at positions with 
too great a coverage [44], e.g. on positions where 
massive pile-ups are found which are either sequen- 
cing or mapping artifacts. Increasing the number of 
sequenced samples (individually or multiplexed) may 
also result in more power to confidently call vari- 
ations [45]. For instance, applying an index-based 
multiplexed targeted sequencing approach would 
remove run-to-run biases and in turn facilitate cal- 
culating error estimates for genetic polymorphism 
detection [46]. Computing inter-sample concord- 
ance rates at each base provides yet another way to 
highlight sequencing errors. Sometimes, manual read 



inspection is necessary to refine SNP calls, but this is 
time consuming unless it can be partially automated. 
Other useful strategies include applying index-based 
sample multiplexing, processing controls of known 
sequence (e.g. HapMap DNAs) and testing parent- 
child trios. These 'multisample' approaches allow 
one to estimate genotype concordance rates, detect 
Mendelian errors and measure allelic bias at hetero- 
zygous sites. This latter problem (systematic distor- 
tions in the recovery of one nucleotide allele over 
another) could be due to a bias in the targeted en- 
richment process, in the preparation and amplifica- 
tion of the sequencing library, or during sequencing 
or postsequencing analysis [47]. 



CHALLENGES AND FUTURE 
PERSPECTIVES 

The main reason that targeted enrichment has been 
developed as an adjunct for NGS in recent years is 
that it was needed to make extensive sequencing 
affordable for sub regions of complex genomes. The 
alternative of fully sequencing many complete gen- 
omes to high average coverage (~30x or higher) to 
enable things like genetic variation analysis, was 
simply not affordable. Another reason for assaying, 
e.g. exome rather than whole-genome sequencing is 
the simpler data interpretation of the former. This is 
a crucial consideration as it is generally much more 
challenging to find the functional impact of variants 
in noncoding regions. A comparison of today's costs 
for whole-genome sequencing and targeted enrich- 
ment is show in Figure 3. 

Current targeted enrichment methods are not yet 
optimal, and must be improved if they are to be 
relevant for a long time to come. One fundamental 
problem is the lack of evenness of coverage [48], 
which is especially troublesome if the results are in- 
tended for diagnostic purposes. Poor evenness across 
regions with differing percentages of GC bases is a 
general problem for NGS itself [2], which directly 
translates into lower coverage of promoter regions 
and the first exon of genes as these are often GC 
rich. Such problems are exacerbated by GC content 
and other biases suffered by enrichment technolo- 
gies. Therefore, for reliable results, a high coverage 
is invaluable — but current methods for targeting sev- 
eral mega base pairs might only return 60—80% of the 
ROI at a read depth of over 40x, and 80-90% at 
around 20x coverage. 
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Figure 3: Sequencing costs for short read 
next-generation sequencing. Since introduction in early 
2008, costs dropped radically and are represented in a 
straight line on a logarithmic scale. The cost differential 
between sequencing, e.g. a full human genome or a 
human exome (plotted with already doubled coverage) 
is the cost that can be spent for targeted enrichment. 
Therefore, targeted enrichment is still an overall 
cost-efficient method, if costs for targeted enrichment 
stays within this cost space, this especially holds true 
for small target regions. Data adapted from NHGRI [51]. 



The comparison of different genome partitioning 
methods in Figure 4 gives a real-world indication of 
how very divergent the results of the available meth- 
ods can be. Even for the same genetic locus, pro- 
cessed by the same people in the same laboratory, the 
different enrichment methods produce very different 
average coverage, evenness and specificity. All four 
hybrid capture methods, including three solution 
phase methods (home made, Flexelect, SureSelect) 
and one solid phase method (NimbleGen) show 
considerable fluctuation in coverage over the tar- 
geted region of interest. Depending on the length 
of fragment library, off-target sequences protrude 
more or less into genomic regions adjacent to the 
target region. In comparison, the SelectorProbe en- 
richment shows a more even coverage for the tar- 
geted region and fluctuations in coverage are due to 
the number of hybridization probes designed. The 
PCR-based enrichment (RainDance) results in the 
most even coverage across the targeted region, but 
this is flanked by the typical high coverage reads for 
the primer pair used for enrichment. 

For an improved understanding of many single 
gene disorders, targeted enrichment can help pro- 
duce a catalog of rare causative mutations by deeply 
sequencing genomic loci of a large number of 



patients. The analysis of genetic variation in complex 
disease is not necessarily limited to human DNA 
but can also be applied in other health-relevant 
fields such as microbiology [49]. In principle, tar- 
geted enrichment in conjunction with NGS provides 
emerging possibilities in many areas relying on 
molecular-based technologies ranging from micro- 
bial testing to diagnostics [50]. 

Still, clinical diagnostic applications of sequencing 
where specific clinical questions need to be answered 
might favor analysis of only the relevant loci at high 
coverage. This has a number of advantages. First, a 
highly accurate answer is provided, which is required 
when clinicians take decisions about supplying or 
withholding expensive targeted biological drugs to, 
for instance, cancer patients. Second, a targeted se- 
quencing approach has the advantage of focusing 
directly to the region of interest and therefore omit- 
ting not directly relevant genomic information. 
Third, an important point to consider is regulatory 
approval of further sequencing-based diagnostic tests. 
Given that regulatory approval is supplied for a dedi- 
cated and specific test that addresses a specific ques- 
tion, a targeted sequencing approach might be more 
acceptable to regulatory agencies. Hence, ultimately 
the adoption of enrichment methods in the sequen- 
cing field may evolve differently in the research and 
diagnostics fields. Indeed, the future use of sequen- 
cing for diagnostics may naturally move toward a 
'single cartridge per patient' approach, as is the cur- 
rent practice for other types of molecular diagnostics. 

Looking to the future, whole-genome sequen- 
cing will continue to become cheaper, simpler, and 
faster. This will steadily erode the rationale for using 
targeted enrichment rather than directly sequencing 
the complete genome and bioinformatically extract- 
ing the sequences of interest. The long term utility of 
targeted enrichment will depend increasingly on pro- 
gress toward evenness and enrichment power im- 
provements (to increase the value of the data), and 
also on new and better strategies for sample multi- 
plexing and pooling (to bring down the per sample 
cost). 

In conclusion, with cheap 3rd generation sequen- 
cing on the horizon, and with improvements in tar- 
geted enrichment still occurring, the field of targeted 
enrichment has not yet lost its raison d'etre. Current 
international large-scale sequencing projects like the 
1000 Genomes Project [37] also rely on targeted 
enrichment for NGS besides whole-genome sequen- 
cing because, the upfront expenses in sample 
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preparation are more than reimbursed by a signifi- 
cantly reduced total sequencing demand and reduced 
downstream processing in terms of data analysis and 
storage for generating high coverage sequence data. 
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